fix: deflake //rs/tests/node:guestos_no_failed_systemd_units by basvandijk · Pull Request #10037 · dfinity/ic

basvandijk · 2026-04-27T16:50:44Z

Root cause

In the last week, //rs/tests/node:guestos_no_failed_systemd_units flaked twice
on master / PR branches. In both cases the assertion in
rs/tests/node/guestos_no_failed_systemd_units.rs:40 panicked because
systemctl list-units --failed on the GuestOS node reported:

● reload_nftables.service loaded failed failed Reload nftables when the configuration changes

reload_nftables.service is a path-activated Type=oneshot unit that runs
nft flush ruleset && systemctl reload nftables.service whenever the
orchestrator-generated /run/ic-node/nftables-ruleset/nftables.conf changes.
A previous fix (#9857) added After=nss-lookup.target because
nft resolves the hostos hostname via the nss_icos NSS plugin and would
fail with "Could not resolve hostname" if name resolution wasn't ready yet.

That ordering is necessary but not sufficient: any transient failure of
systemctl reload nftables.service (e.g. nftables.service momentarily
restarting, a brief nss_icos hiccup) leaves this oneshot unit stuck in the
failed state. Because the unit is purely path-activated and the orchestrator
only rewrites nftables.conf on content change, there is no automatic retry,
so the failure persists until the next content change — long enough for the
test assertion to observe it.

Fix

Retry the reload up to 5 times with a 2s sleep in between. This absorbs
transient failures and lets the unit finish successfully without changing the
overall semantics (still Type=oneshot, still triggered by the path unit).

Verification

Ran 3 parallel iterations locally; all passed:

bazel test --test_output=errors --runs_per_test=3 --jobs=3 \\
  //rs/tests/node:guestos_no_failed_systemd_units
# 1 of 1 test passed (3 runs).

PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Retry the nftables reload up to 5 times in reload_nftables.service to absorb transient failures (e.g. nss_icos hostname resolution hiccups, or nftables.service being momentarily restarting). Without a retry, a single transient failure leaves this oneshot unit stuck in the failed state until the orchestrator rewrites nftables.conf again, which only happens on content change. The guestos_no_failed_systemd_units test then panics because 'systemctl list-units --failed' reports reload_nftables.service.

github-actions Bot added the fix label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deflake //rs/tests/node:guestos_no_failed_systemd_units#10037

fix: deflake //rs/tests/node:guestos_no_failed_systemd_units#10037
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-guestos-no-failed-systemd-units-2026-04-27

basvandijk commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

basvandijk commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

basvandijk commented Apr 27, 2026 •

edited

Loading